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Substitute specification: 



- - METHOD AND ARRANGEMENT FOR DETERMINING 
A SEQUENCE OF ACTIONS FOR A SYSTEM 



BACKGROUND OF THE INVENTION 



Field of the Invention: 

This invention generally pertains to systems having states, and in particular to 
methods for determining a sequence of actions for such systems. 

Discussion of the Related Art: 

A generalized method and arrangement for determining a sequence of 
actions for a system having states, wherein a transition in state between two states 
is performed on the basis of an action, is discussed by Neuneier in "Enhancing Q- 
Learning for Optimal Asset Allocation", appearing in the Proceedings of the Neural 
Information Processing Systems, NJPS 1997. Neuneier describes a financial market 
as an example of a system which has states. His system is described as a Markov 
Decision Problem (MOP). 

The characteristics of a Markov Decision Problem are represented below by 
way of summary: 

X set of possible states of the system, 



e.g. X = 9? m , 



A(x t ) 



set of possible actions in the state 



p(x t+i | x t ,a t ) 
r(x t , a t , x t+i ) 



x t 

gain with expectation R(x tj a t ). 



Starting from observable variables, the variables denoted below as training 
data, the aim is to determine a strategy, that is to say a sequence of functions 



which at each instant t map each state into an action rule, that is to say action 



Such a strategy is evaluated by an optimization function. 

The optimization function specifies the expectation, the gains accumulated 
over time at a given strategy n, and a start state x 0 . 

The so-called Q-learning method is described by Neuneier as an example of 
a method of approximative dynamic programming. 

An optimum evaluation function V*(x) is defined by 
v*(x) = max v n (x) Vx € X (5) 



71 = {^0' Hi' K. r Ht} ' 



(3) 



Ht(*t) - a t 



(4) 



where 



00 



v "( x ) ~ E Z T t r(x t , n t , x t + 1 )|x 0 = x , 
t = 0 



(6) 



Y denoting a prescribable reduction factor which is formed in accordance with the 
following rule: 

2 € W + . , R) 



A Q-evaluation function Q*(x t ,a t ) is formed within the Q-leaming method for 
each pair (state x t , action a t ) in accordance with the following rule: 

Q (x t , a t ): = £ F>(xt + lk:/ a t) ' r t + 

xeX 

+Y • Z p( x | x t' a t) • max(Q*(x, a)) 
X€X ag A x 

(9) 

On the basis respectively of the tupel (x t , x t+ i, a t , r t ), the Q-values Q* (x,a) are 
adapted in the k+1 th iteration in accordance with the following learning rule with a 
prescribed learning rate t\ k in accordance with the following rule: 



Qk + l(*t< * t ) = (l " 'HkWxt' at) + *lk Ut + Y max(Q k (x t + 1 , a)) 

v a eA ' 



(10) 



Usually, the so-called Q-values Q*(x,a) are approximated for various actions by 
a function approximator in each case, for example a neural network or a polynomial 
classifier, with a weighting vector w 3 , which contains weights of the function 
approximator. 



A function approximator is, for example, a neural network, a polynomial 
classifier or a combination of a neural network with a polynomial classifier. 



It therefore holds that: 

Q*(x, a) * q(x; w a ) . 



(11) 



Changes in the weights in the weighting vector w 3 are based on a temporal 
difference d t which is formed in accordance with the following rule: 

d t : « r(x t , a t , x t + i) + y max Q(x t + 1 ; w^J - q[ x t ; J (12) 

aeA 

The following adaptation rule for the weights of the neural network, which are 
included in the weighting vector w 3 , follows for the Q-learning method with the use of 
a neural network: 

w k + l - "J* + Ik • <»f VQ ( x t ; w k0 • (13) 

The neural network representing the system of a financial market as described 
by Neuneier is trained using the training data which describe information on changes 
in prices on a financial market as time series values. 

A further method of approximative dynamic programming is the so-called TD(X) 
learning method. This method is discussed in R.S. Sutton's, "Learning To Predict By 
The Method Of Temporal Differences", appearing in Machine Learning, Chapter 3, 
pages 9 -44, 1988. 

Furthermore, it is known from M. Heger's, "Risk and Reinforcement Learning: 



Concepts and Dynamic Programming", ZKW Bericht No. 8/94, Zentrum fur 
Kognitionswissenschaften [Center for Cognitive Sciences], Bremen University, 
December 1994, that risk is associated with a strategy n and an initial 
state x t . A method for risk avoidance is also discussed by Hager, cited above. 

The following optimization function, which is also referred to as an expanded 
Q-function Q*(x t) a t ), is used in the Hager method: 



The expanded Q-function Q*(x t , a t ) describes the worst case if the action a t is 
executed in the state x t and the strategy n is followed thereupon. 



maximize 



\ 




J 



(14) 



The optimization function Q n (x tt a t ) for 



Q*( x t/ a t ): = 



max 



Q n (*t<*t) 



(15) 



is given by the following rule: 




min 

X€=X 




p(xt + l|xt'at)>° 
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A substantial disadvantage of this mode of procedure is that only the worst 
case is taken into account when finding the strategy. However, this inadequately 
reflects the requirements of the most varied technical systems. 

In "Dynamic Programming and Optimal Control", Athena Scientific, Belmont, 
MA, 1995, DP. Bertsekas formulates access control for a communications network 
and routing within the communications network as a problem of dynamic 
programming. 

Therefore, the present invention is based on the problem of specifying a 
method and system for determining a sequence of actions in which the method or 
sequences of actions achieve an increased flexibility in determining the strategy 
needed. 

In a method for computer-aided determination of a sequence of actions for a 
system which has states, a transition in state between two states being performed 
on the basis of an action, the determination of the sequence of actions is performed 
in such a way that a sequence of states resulting from the sequence of actions is 
optimized with regard to a prescribed optimization function, the optimization function 
including a variable parameter with the aid of which it is possible to set a risk which 
the resulting sequence of states has with respect to a prescribed state of the system. 

A system for determining a sequence of actions for a system which has 
states, a transition in state between two states being performed on the basis of an 
action, has a processor which is set up in such a way that the determination of the 



sequence of actions can be performed in such a way that a sequence of states 
resulting from the sequence of actions is optimized with regard to a prescribed 
optimization function, the optimization function including a variable parameter with 
the aid of which it is possible to set a risk which the resulting sequence of states has 
with respect to a prescribed state of the system. 

Thus, the present invention offers a method for determining a sequence of 
actions at a freely prescribable level of accuracy when finding a strategy for a 
possible closed-loop control or open-loop control of the system, in general for 
influencing it. Hence, the embodiments described below are valid both for the 
method and for the system. 

Approximative dynamic programming is used for the purpose of determina- 
tion, for example a method based on Q-learning or a method based on TD(X)- 
learning. 

Within Q-learning f the optimization function OFQ is preferably formed in 
accordance with the following rule: 



x denoting a state in a state space X 

a denoting an action from an action space A, and 

w a denoting the weights of a function approximator which belong to the action 




The following adaptation step is executed during Q-learning in order to 
determine the optimum weights w 3 of the function approximator: 

»t + l = + nt • K K (dt) • Vo(x t ; w^) 

with the abbreviation 

d t = r(x t , a t/ x t + 1 ) + y max ofx t +l# "t) " Q( x t' "t* 1 ) 

a eA 

• x t) x t +1 respectively denoting a state in the state space X, 

• a t denoting an action from an action space A, 

• y denoting a prescribable reduction factor, 

• w * * denoting the weighting vector associated with the action a t before the 
adaptation step, 

• w t*+i denoting the weighing vector associated with the action a t after the 

adaptation step, 

• r|t (t = 1 , ...) denoting a prescribable step size sequence, 

• k e [-1; 1] denoting a risk monitoring parameter, 

• K K denoting a risk monitoring function K K (4) = (1 - Ksign(4))4, 

• VQ( ; ) denoting the derivation of the function approximator according to its 
weights, and 

• r(x t , a tf x t +i) denoting a gain upon the transition of state from the state x t to the 
subsequent state x t+ i. 

The optimization function is preferably formed in accordance with the following 



8 



rule within the TD(A,)-learning method: 
OFTD = J(x;w) 

• x denoting a state in a state space X, 

• a denoting an action from an action space A, and 

• w denoting the weights of a function approximator. 

The following adaptation step is executed during TD(A,)-learning in order to 
determine the optimum weights w of the function approximator: 

w t+ i = w t + r|f K K (d t )- z t 

with the abbreviations 

d t = r(w t , a t> x t+ i) + yJ(x t+ i; w t ) - J(x t ; w t ), 

z t = \- y zt.i + VJ(x t ; w t ), 

z-1 =0 

• x t , x t+ i respectively denoting a state in the state space X, 

• a t denoting an action from an action space A, 

• y denoting a prescribable reduction factor, 

• w t denoting the weighting vector before the adaptation step, 

• w t+ i denoting the weighting vector after the adaptation step, 

• rit (t = 1 , ...) denoting a prescribable step size sequence, 

• k e [-1; 1] denoting a risk monitoring parameter, 

• K K denoting a risk monitoring function K K = (1 - KSign(^))^, 

• VJ( ; ) denoting the derivation of the function approximator according to its 



weights, and 

• r(x t( a tl x t+ i) denoting a gain upon the transition of state from the state x t to the 
subsequent state Xt+i. 

SUMMARY OF THE INVENTION 

it is an object of the present invention to provide a technical system and 
method for determining a sequence of actions using measured values. 

It is another object of the present invention to provide a technical system and 
method that can be subjected to open-loop control or closed-loop control with the 
use of a determined sequence of actions. 

It is a further object of the invention to provide a technical system and method 
modeled as a Markov Decision Problem. 

It is an additional object of the invention to provide a technical system and 
method that can be used in a traffic management system. 

It is yet another object of the invention to provide a technical system and 
method that can be used in a communications system, such that a sequence of 
actions is used to carry out access control, routing or path allocation. 

It is yet a further object of the invention to provide a technical system and 
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method for a financial market modeled by a Markov Decision Problem, wherein a 
change in an index of stocks, or a change in a rate of exchange on a foreign 
exchange market, makes it possible to intervene in the market in accordance with a 
sequence of determined actions. 



These and other objects of the invention will be apparent from a careful review 
of the following detailed description of the preferred embodiments, which is to read in 
conjunction with a review of the accompanying drawing figures. 



BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 shows a flowchart of method steps according to the present invention; 

Figure 2 shows a system modeled as a Markov Decision Problem; 

Figure 3 shows a communications network wherein access control is carried out 

in a switching unit according to the present invention; 
Figure 4 shows a function approximator for approximative dynamic 

programming according to the present invention; 
Figure 5 shows a plurality of function approximators for approximative dynamic 

programming according to the present invention; and 
Figure 6 shows a traffic management system subjected to closed-loop control in 

accordance with the present invention. 



DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 



Figure 1 shows a flowchart according to the present invention, in which 
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individual method steps of a first embodiment are provided, which will be discussed 
later. 

Figure 2 shows the structure of a typical Markov Decision Problem method. 

The system 201 is in a state x t at an instant t. The state x t can be observed by 
an observer of the system. On the basis of an action a t from a set in the state x t of 
possible actions, at e A(x t ), the system makes a transition with a certain probability 
into a subsequent state x t +1 at a subsequent instant t+1 . 

As illustrated diagrammatically in Figure 2 by a loop, an observer 200 
perceives 202 observable variables concerning the state x t and takes a decision via 
an action 203 with which it acts on the system 201 . The system 201 is usually 
subject to the interference 205. 

The observer 200 obtains a gain r t 204 
r t = r(x t , a t , x t + 1 ) e 9t , (1) 

which is a function of the action a t 203 and the original state x t at the instant t as well 
as of the subsequent state x t +1 of the system at the subsequent instant t+1 . 

The gain r t can assume a positive or negative scalar value depending on 
whether the decision leads, with regard to a prescribable criterion, to a positive or 
negative system development, to an increase in capital stock or to a loss. 



In a further time step, the observer 200 of the system 201 decides on the 
basis of the observable variables 202, 204 of the subsequent state x t +i in favor of a 
new action at+i, etc. 



A sequence of 



State: 



e 



X 



Action: 



A(x t ) 



Subsequent state: 



x t +1 



X 



Gain 



r(x t , a t , x t+ i) 



describes a trajectory of the system which is evaluated by a performance criterion 
which accumulates the individual gains r t over the instants t. It is assumed by way of 
simplification in a Markov Decision Problem that the state x t and the action a t all 
contain information for the purpose of describing a transition probability p(x t+ i |- )of 
the system from the state x t to the subsequent state x t +i. 

In formal terms, this means that: 

p( x t + l|x t ,K / xq, a t , K , a 0 ) = p(x t + i|x t , a t ) . (2) 

p(x t+ i I x t ,a t ) denotes a transition probability for the subsequent state x t +i for a given 
state x t and given action a t . 

In a Markov Decision Problem, future states of the system 201 are thus not a 
function of states and actions which lie further in the past than one time step. 



Figure 3 shows an embodiment of the present invention involving an access 
control and routing system, such as a communications network 300. 

The communications network 300 has a multiplicity of switching units 301a, 
301b, .... 301 i, ... 301 n, which are interconnected via connections 302a, 302b, 302j, 
... 302m. A first terminal 303 is connected to a first switching unit 301a. From the first 
terminal 303, the first switching unit 301a is sent a request message 304 which 
requests preservation of a prescribed bandwidth within the communications network 
300 for the purpose of transmitting data, such as video data or text data. 

It is determined in the first switching unit 301a in accordance with a strategy 
described below, whether the requested bandwidth is available in the communi- 
cations network 300 on a specified, requested connection instep 305. The request is 
refused instep 306 if this is not the case. If sufficient bandwidth is available, it is 
checked in checking step 307 whether the bandwidth can be reserved. 

The request is refused in step 308 if this is not the case. Otherwise, the first 
switching unit 301a selects a route from the first switching unit 301a via further 
switching units 301 i to a second terminal 309 with which the first terminal 303 wishes 
to communicate, and a connection is initialized in step 310. 

The starting point below is a communications network 300 which comprises a 
set of switching units 

N= {l,K , n,K , N} (17) 

and a set of physical connections 



L= {U,1,K,L}, (18) 

a physical connection I having a capacity of B(l) bandwidth units. 



A set 

M= {l, K , m, K. , M} (19) 

of different types of service m are available, a type of service m being characterized 
by 

• a bandwidth requirement b(m), 

• an average connection time — - — , and 

V(m) 

• a gain c(m) which is obtained whenever a call request of the corresponding 
type of service m is accepted. 

The gain c(m) is given by the amount of money which a network operator of 
the communications network 300 bills a subscriber for a connection of the type of 
service. Clearly, the gain c(m) reflects different priorities, which can be prescribed by 
the network operator and which he associates with different services. 

A physical connection 1 can simultaneously provide any desired combination 
of communications connections as long as the bandwidth used for the 
communications connections does not exceed the bandwidth available overall for the 
physical connection. 

If a new communications connection of type m is requested between a first 
node i and a second node j (terminals are also denoted as nodes), the requested 
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communications connection can, as represented above, either be accepted or be 
refused. If the communications connection is accepted, a route is selected from a set 
of prescribed routes. This selection is denoted as a routing. b(m) bandwidth units are 
used in the communications connection of type m for each physical connection along 
the selected route for the duration of the connection. 

Thus, during access control, also referred to as call admission control, a route 
can be selected within the communications network 300 only when the selected 
route has sufficient bandwidth available. The aim of the access control and of the 
routing is to maximize a long term gain which is obtained by acceptance of the 
requested connections. 

At an instant t, the technical system which is the communications network 300 
is in a state x t which is described by a list of routes via existing connections, by 
means of which lists it is shown how many connections of which type of service are 
using the respective routes at the instant t. 

Events w, by means of which a state x t could be transferred into a subsequent 
state xt+i, are the arrival of new connection request messages, or else the termina- 
tion of a connection existing in the communications network 300. 

In this embodiment, an action a t at an instant t, owing to a connection request 
is the decision as to whether a connection request is to be accepted or refused and, 
if the connection is accepted, the selection of the route through the communications 
network 300. 
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The aim is to determine a sequence of actions, that is to say clearly to 
determine the learning of a strategy with actions relating to a state x t in such a way 
that the following rule is maximized: 



00 . 

X e" pt k • g(x tjc , © k , a tjc ) 



(20) 



E{.} denoting an expectation, 

t k denoting an instant at which a kth event takes place, 

g(x tk ,(o k ,a tk ). denoting the gain which is associated with the kth event, and 

p denoting a reduction factor which evaluates an immediate gain as being 
more valuable than a gain at instants lying further in the future. 

Different implementations of a strategy lead normally to different overall gains 



G: 



00 

G = X e ~^ tk * 9\ x t: k ' <°k> a t k ) • (21) 



k = 0 



The aim is to maximize the expectation of the overall gain G in accordance 
with the following rule J: 



J = E 



^e-^k • g(x tk/ (D k/ a tk ) 
k = 0 



<22) 



it being possible to set a risk which reduces the overall gain G of a specific 
implementation of access control and of a routing strategy to below the expectation. 



The TD(X)-learning method is used to carry out the access control and the 
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routing. 



The following target function is used in this embodiment: 
J*(x t ) = Et{e~ Pt jEco jmax[g(x t , co t , a) + J*(x t + 1 ) j, (23) 

• A denoting an action space with a prescribed number of actions which are 
respectively available in a state x t , 

• t denoting a first instant at which a first event o> occurs, and 

• x t+ i denoting a subsequent state of the system. 

An approximated value of the target value J*(x t ) is learned and stored by 
employing a function approximator 400 (compare Figure 4) with the use of training 
data. 

Training data are data previously measured in the communications network 
300 and relating to the behavior of the communications network 300 in the case of 
incoming connection requests 304 and of termination of messages. This time 
sequence of states is stored, and these training data are used to train the function 
approximator 400 in accordance with the learning method described below. 

A number of connections of in each case one type of service m on a route of 
the communications network 300 serve in each case as input variable of the function 
approximator 400 for each input 401 , 402, 403 of the function approximator 400. 
These are represented in Figure 4 by blocks 404, 405, 406. An approximated target 
value J of the target value J* is the output variable of the function approximator 400. 
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Figure 5 shows a detailed representation of a function approximator 500, which 
has several component function approximators 510, 520. 



One output variable is the approximated target value j , which is formed in 
accordance with the following rule: 



The input variables of the component function approximators 510, 520, which 
are present at the inputs 51 1 , 512, 513 of the first component function approximator 
51 0, or at the inputs 521 , 522 and 523 of the second component function 
approximator 520 are, in turn, respectively a number of types of service of a type m 
in a physical connection r in each case, symbolized by blocks 514, 515, 516 for the 
first component function approximator, and 524, 525 and 526 for the second 
component function approximator 520. 

Component output variables 530, 531 , 532, 533 are fed to an adder unit 540, 
and the approximated target variable j is formed as output variable of the adder 
unit. 

Let it be assumed that the communications network 300 is in the state x tk and 

that a request message with which a type of service m of class m is requested for a 
connection between two nodes i, j reaches the first switching unit 301a. 




(24) 
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A list of permitted routes between the nodes i and j is denoted by R(i, j), and a 
list of all possible routes is denoted by 

R(i, j, x tk ) c R(i, j) (25) 

as a subset of the routes R(i, j) which could implement a possible connection with 
regard to the available and requested bandwidth. 

For each possible route r, reR(i, j,x tk ), a subsequent state 

x tk + i(x tk , © k ,r) is determined which results from the fact that the connection 

request 304 is accepted and the connection on the route r is made available to the 
requesting first terminal 303. 

This is illustrated in Figure 1 as step 102, the state of the system and the 
respective event being respectively determined in step 101. A route r* to be selected 
is determined in step 103 in accordance with the following rule: 

r* = arg max 5( x t k +l( x t k > r), © t ) . (26) 
reR(i,j,x tjc J 

A check is made in step 104 as to whether the following rule is fulfilled: 

c(m) + j(x tk +l(x tk , co k , r*J, 0 t ) < j(x t]c , © t ) • (27) 

If this is the case, the connection request 304 is rejected in step 105, otherwise 
the connection is accepted and "switched through" to the node j along the selected 
route r* in step 106. 
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Weights of the function approximator 400, 500 which are adapted in the 
TD(A.)-learning method to the training data, are stored in a parameter vector 0 for an 
instant t in each case, such that an optimized access control and an optimized 
routing are achieved. 



During the training phase, the weighting parameters are adapted to the 
training data applied to the function approximator. 



A risk parameter k is defined with the aid of which a desired risk, which the 
system has with regard to a prescribed state owing to a sequence of actions and 
states, can be set in accordance with the following rules: 



-1 < k < 0: risky learning, 

k = 0: neutral learning with regard to the risk, 

0 < k < 1 : risk-avoiding learning, 

k = 1 : worst-case learning. 

Furthermore, a prescribable parameter 0 < X < 1 and a step size sequence y k 
are prescribed in the learning method. 



The weighting values of the weighting vector 0 are adapted to the training 
data on the basis of each event co tk in accordance with the following adaptation 

rule: 

©k = ©k-i + 7k* K (dkK, (28> 
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in which case 
d k = e-Pfo-^-l^g^ 

(29) 

z t = A.e-P( t k-l- t k-2) 2t-1 + Vejlxt^^eic.i), (30) 

and 

N K (§) = (l - Ksign(4))4. (31) 

It is assumed that: z_i = 0. 
The function 

s( x t k / «> k , a tk ) (32) 

denotes the immediate gain in accordance with the following rule: 



c(m) when co tk is a service request for a type of 

service m, and the connection is accepted 
0 otherwise 

(33) 



Thus, as described above, a sequence of actions is determined with regard to 
a connection request such that a connection request is either rejected or accepted 
on the basis of an action. The determination is performed taking account of an 
optimization function in which the risk can be set by means of a risk control 
parameter k € [-1; 1] in a variable fashion. 
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Figure 6 shows an embodiment of the present invention in relation to a traffic 
management system 

A road 600 on which automobiles 601, 602, 603, 604, 605 and 606 are being 
driven. Conductor loops 610, 61 1 integrated into the road 600 receive electric 
signals in a known way and feed the electric signals 615, 616 to a computer 620 via 
an input/output interface 621. In an anaiog-to-digital converter 622 connected to the 
input/output interface 621, the electric signals are digitized into a time series and 
stored in a memory 623, which is connected by a bus 624 to the analog-to-digital 
converter 622 and a processor 625. Via the input/output interface 621 , a traffic 
management system 650 is fed control signals 651 from which it is possible to set a 
prescribed speed stipulation 652 in the traffic management system 650, or else 
further particulars of traffic regulations, which are displayed via the traffic 
management system 650 to drivers of the vehicles 601 , 602, 603, 604, 605 and 606. 

The following local state variables are used in this case for the purpose of 
traffic modeling: 

• traffic flow rate v, 

Fz 

• vehicle density p (p = number of vehicles per kilometer — ). 

km 

Fz 

• traffic flow q (q = number of vehicles per hour — , (q= v * p)), and 

h 

• speed restrictions 652 displayed by the traffic management system 650 at an 
instant in each case. 



The local state variables are measured as described above by using the 
conductor loops 610, 611. 

These variables (v(t), p(t), q(t)) therefore represent a state of the technical 
system of "traffic" at a specific instant t. 

In this embodiment, the system is therefore a traffic system which is 
controlled by using the traffic management system 650, and an extended Q-learning 
method is described as method of approximative dynamic programming. 

The state x t is described by a state vector 

x(t)= (v(t), p(t> q(t)) . (34) 

The action a t denotes the speed restriction 652, which is displayed at the 
instant t by the traffic management system 650. The gain r(x t , a t , x t+ i) describes the 
quality of the traffic flow which was measured between the instants t and t+1 by the 
conductor loops 610 and 611. 

In this embodiment, r(x t , a t , x t+ i) denotes 

• the average speed of the vehicles in the time interval [t, t + 1] 
or 

• the number of vehicles which have passed the conductor loops 61 0 and 61 1 
in the time interval [t, t + 1] 

or 



• the variance of the vehicle speeds in the time interval [t, t + 1], 
or 

• a weighted sum from the above variables. 

A value of the optimization function OFQ is determined for each possible 
action a t , that is to say for each speed restriction which can be displayed by the 
traffic management system 650, an estimated value of the optimization function OFQ 
being realized in each case as a neural network. 

This results in a set of evaluation variables for the various actions a t in the 
system state x t . Those actions a t for which the maximum evaluation variable OFQ 
has been determined in the current system state x t are selected in a control phase 
from the possible actions a t , that is to say from the set of the speed restrictions which 
can be displayed by the traffic management system 650. 

In accordance with this embodiment, the adaptation rule, known from the Q- 
learning method, for calculating the optimization function OFQ is extended by a risk 
control function N K (.), which takes account of the risk. 

In turn, the risk control parameter k is prescribed in accordance with the 
strategy from the first exemplary embodiment in the interval of [-1 < k < 1], and 
represents the risk which a user wishes to run in the application with regard to the 
control strategy to be determined. 



The following evaluation function OFQ is used in accordance with this 
exemplary embodiment: 



• x = (v; p; q) denoting a state of the traffic system, 

• a denoting a speed restriction from the action space A of all speed restrictions 
which can be displayed by the traffic management system 650, and 

• w 3 denoting the weights of the neural network which belong to the speed 
restriction a. 

The following adaptation step is executed in Q-learning in order to determine 
the optimum weights w a of the neural network: 



OFQ 




(35) 



w 



t + 1 = w t + It * N 




(36) 



using the abbreviation: 



d t = r{x t , a t/ x t + ],) + y max o(x t + i 

aeA 



(37) 



x t , x t+ i denoting in each case a state of the traffic system in accordance with 



rule (34), 



a t denoting an action, that is to say a speed restriction which can be displayed 



by the traffic management system 650, 



y denoting a prescribable reduction factor, 



w 



denoting the weighting vector belonging to the action a t , before the 



adaptation step, 
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• w tli denoting the weighting vector belonging to the action a t , after the 

adaptation step, 

• T]t (t = 1 , ...) denoting a prescribable step size sequence, 

• k e [-1; 1] denoting a risk control parameter, 

• K K denoting a risk control function K K (^) = (1 - Ksign(^))^, 

• V Q ( ; ) denoting the derivative of the neural network with respect to its 
weights, and 

• r(x t , a t , x t+ i) denoting a gain upon the transition in state from the state x t to the 
subsequent state x t+ i. 

An action a t can be selected at random from the possible actions a t during 
learning. It is not necessary in this case to select the action a t which has led to the 
largest evaluation variable. 

The adaptation of the weights has to be performed in such a way that not only is 
a traffic control achieved which is optimized in terms of the expectation of the 
optimization function, but that also account is taken of a variance of the control 
results. 

This is particularly advantageous since the state vector x(t) models the actual 
system of traffic only inadequately in some aspects, and so unexpected disturbances 
can thereby occur. Thus, the dynamics of the traffic, and therefore of its modeling, 
depend on further factors such as weather, proportion of trucks on the road, 
proportion of mobile homes, etc., which are not always integrated in the measured 
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variables of the state vector x(t). In addition, it is not always ensured that the road 
users immediately implement the new speed instructions in accordance with the 
traffic management system. 

A control phase on the real system in accordance with the traffic management 
system takes place in accordance with the following steps: 

1 . The state x t is measured at the instant t at various points in the traffic system 
of traffic and yields a state vector x(t): = (v(t), p(t), q(t)). 

2. A value of the optimization function is determined for all possible actions a t , 
and that action a t with the highest evaluation in the optimization function is 
selected. 

Although modifications and changes may be suggested by those skilled in the 
art to which this invention pertains, it is the intention of the inventors to embody 
within the patent warranted hereon all changes and modifications that may 
reasonably and properly come under the scope of their contribution to the art. - - 
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